This quickstart guide explains how to join two tables A and B using TF-IDF similarity measure. First, you need to import the required packages as follows (if you have installed py_stringsimjoin it will automatically install the dependencies py_stringmatching and pandas):



In [1]:

    
# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os
import sys



In [3]:

    
print('python version: ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)









    



python version: 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
py_stringsimjoin version: 0.1.0
py_stringmatching version: 0.2.0
pandas version: 0.16.2

Joining two tables using TD-IDF measure typically consists of six steps:

Loading the input tables
Profiling the tables
Creating a tokenizer
Applying overlap filter
Creating the corpus for TF-IDF matcher
Applying the TF-IDF matcher

1. Loading the input tables

We begin by loading the two tables. For the purpose of this guide, we use the books dataset that comes with the package.



In [4]:

    
# construct the path of the tables to be loaded. Since we are loading a 
# dataset from the package, we need to access the data from the path 
# where the package is installed. If you need to load your own data, you can directly
# provide your table path to the read_csv command.

table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_table_A.csv.gz'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_table_B.csv.gz'])



In [5]:

    
# Load csv files as dataframes. Since we are reading a compressed csv file, 
# we provide the compression argument. If you are reading an uncompressed 
# csv file, you should not specify the compression argument.

A = pd.read_csv(table_A_path, compression='gzip')
B = pd.read_csv(table_B_path, compression='gzip')
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))









    



Number of records in A: 3022
Number of records in B: 3099



In [6]:

    
A.head(1)









    Out[6]:






  
    
      
      ID
      Title
      Price
      Author
      ISBN13
      Publisher
      Publication_Date
      Pages
      Dimensions
    
  
  
    
      0
      BN0001
      The Maze Runner Series Complete Collection
      $24.21
      James Dashner
      9780385388894
      Random House Children's Books
      7/8/14
      NaN
      5.60(w) x 8.40(h) x 3.30(d)



In [7]:

    
B.head(1)









    Out[7]:






  
    
      
      ID
      Title
      UsedPrice
      NewPrice
      Author
      ISBN10
      ISBN13
      Publisher
      Publication_Date
      Pages
      Dimensions
    
  
  
    
      0
      HC0001
      The Hunger Games(Hunger Games (Quality))
      $4.98
      $11.02
      Suzanne Collins
      0439023521
      9780439023528
      Scholastic Press
      Jul. 3rd, 2010
      374
      5.30 x 8.00 x 0.90 inches

2. Profiling the tables

Before performing the join, we may want to profile the tables to know about the characteristics of the attributes. This can help identify:

a) unique attributes in the table which can be used as key attribute when performing the join. A key attribute is needed to uniquely identify a tuple.

b) the number of missing values present in each attribute. This can help you in deciding the attribute on which to perform the join. For example, an attribute with a lot of missing values may not be a good join attribute. Further, based on the missing value information you need to decide on how to handle missing values when performing the join

You can profile the attributes in a table using the following command:



In [28]:

    
# profile attributes in table A
ssj.profile_table_for_join(A)









    Out[28]:






  
    
      
      Unique values
      Missing values
      Comments
    
    
      Attribute
      
      
      
    
  
  
    
      ID
      3022 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      Title
      2859 (94.61%)
      0 (0.0%)
      
    
    
      Price
      698 (23.1%)
      0 (0.0%)
      
    
    
      Author
      1225 (40.54%)
      0 (0.0%)
      
    
    
      ISBN13
      2869 (94.94%)
      0 (0.0%)
      
    
    
      Publisher
      200 (6.62%)
      0 (0.0%)
      
    
    
      Publication_Date
      1100 (36.4%)
      0 (0.0%)
      
    
    
      Pages
      303 (10.03%)
      12 (0.4%)
      Joining on this attribute will ignore 12 (0.4%) rows.
    
    
      Dimensions
      1764 (58.37%)
      20 (0.66%)
      Joining on this attribute will ignore 20 (0.66%) rows.



In [26]:

    
# profile attributes in table B
ssj.profile_table_for_join(B)









    Out[26]:






  
    
      
      Unique values
      Missing values
      Comments
    
    
      Attribute
      
      
      
    
  
  
    
      ID
      3099 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      Title
      2913 (94.0%)
      0 (0.0%)
      
    
    
      UsedPrice
      320 (10.33%)
      3 (0.1%)
      Joining on this attribute will ignore 3 (0.1%) rows.
    
    
      NewPrice
      883 (28.49%)
      166 (5.36%)
      Joining on this attribute will ignore 166 (5.36%) rows.
    
    
      Author
      1371 (44.24%)
      5 (0.16%)
      Joining on this attribute will ignore 5 (0.16%) rows.
    
    
      ISBN10
      3099 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      ISBN13
      3099 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      Publisher
      367 (11.84%)
      0 (0.0%)
      
    
    
      Publication_Date
      1165 (37.59%)
      0 (0.0%)
      
    
    
      Pages
      510 (16.46%)
      0 (0.0%)
      
    
    
      Dimensions
      2572 (82.99%)
      248 (8.0%)
      Joining on this attribute will ignore 248 (8.0%) rows.

Based on the profile output, we find that the 'Title' attribute in both tables does not contain any missing values. Hence, for the purpose of this guide, we will now join tables A and B on 'Title' attribute using TF-IDF measure. Next, we need to decide on what threshold to use for the join. For this guide, we will use a threshold of 0.5. Specifically, the join will now find tuple pairs from A and B such that the TF-IDF score over the 'Title' attributes is at least 0.5.

Naively, performing the join will involve enumerating the cartesian product AxB (3022 x 3099 = 9365178) and computing TF-IDF score for every pair. But, this can be very time consuming. Hence, we can optimize by first appplying an overlap filter over tables A and B to find pairs sharing at least one token in the 'Title' attribute. The intuition here is that in order for TF-IDF score to be above zero, there must be at least one common token between the attributes. Finally, we apply the TF-IDF measure over the candidate pairs to obtain the join output.

3. Creating a tokenizer

Since TF-IDF measure treats input strings as bags of tokens, we need to select a tokenizer which can be used to tokenize each string into a bag of tokens. Currently, we support tokenizers from py_stringmatching package which provides five different tokenizer types: alphabetical tokenizer, alphanumeric tokenizer, delimiter-based tokenizer, qgram tokenizer, and whitespace tokenizer.

For the purpose of this guide, we will use a whitespace tokenizer. Once we have selected a tokenizer type, we need to create a tokenizer object as shown below:



In [8]:

    
# create whitespace tokenizer for tokenizing 'Title' attribute
ws = sm.WhitespaceTokenizer()

ws.tokenize('The Maze Runner Series Complete Collection')









    Out[8]:





['The', 'Maze', 'Runner', 'Series', 'Complete', 'Collection']

4. Applying overlap filter

Next, we need to create and apply overlap filter over tables A and B to obtain a candidate set of pairs C.



In [9]:

    
# create overlap filter with whitespace tokenizer and threshold of 1. 
of = ssj.OverlapFilter(ws, 1) 

# apply overlap filter to tables A and B to find tuple pairs 
# sharing at least 1 token in Title attribute
C = of.filter_tables(A, B, 'ID', 'ID', 'Title', 'Title', n_jobs=-1)









    



0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 0.432 sec



In [10]:

    
len(C)









    Out[10]:





1103242



In [11]:

    
C.head(5)

If you want to include pairs with missing value in the output, you need to set the allow_missing flag to True when creating the overlap filter as shown below:



In [ ]:

    
of = ssj.OverlapFilter(ws, 1, allow_missing=True)

Now, when you apply the filter, pairs with missing values will also be included in the output.

5. Creating the corpus for TF-IDF matcher

The next step is to create the corpus required for TF-IDF measure. Specifically, the corpus consists of the list of tokens in the 'Title' attribute. The corpus can be created as follows:



In [12]:

    
# create a list of tokens
A_tokens = A['Title'].apply(ws.tokenize).tolist()
B_tokens = B['Title'].apply(ws.tokenize).tolist()

# merge both the lists of tokens to create the corpus
corpus = A_tokens + B_tokens

6. Applying the TF-IDF matcher

Finally, you need to create and apply the TF-IDF matcher as shown below:



In [13]:

    
# create tf-idf object with the generated corpus
tfidf = sm.TfIdf(corpus, dampen=True)



In [14]:

    
# apply the matcher with a threshold of 0.5. This will find pairs from C 
# with TF-IDF score >= 0.5. Setting n_jobs=-1 exploits all CPU cores available.

output_pairs = ssj.apply_matcher(C, 'l_ID', 'r_ID', A, B, 'ID', 'ID', 'Title', 'Title',
                                 ws, tfidf.get_sim_score, 0.5, 
                                 l_out_attrs=['Title'], r_out_attrs=['Title'], n_jobs=-1)









    



0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 16.146 sec



In [15]:

    
len(output_pairs)









    Out[15]:





1222



In [16]:

    
output_pairs.head()









    Out[16]:






  
    
      
      _id
      l_ID
      r_ID
      l_Title
      r_Title
      _sim_score
    
  
  
    
      0
      33
      BN0070
      HC0001
      The Hunger Games (Hunger Games Series #1)
      The Hunger Games(Hunger Games (Quality))
      0.565250
    
    
      1
      1665
      BN0019
      HC0003
      Looking for Alaska
      Looking for Alaska
      1.000000
    
    
      2
      1885
      BN0043
      HC0006
      The Book Thief
      The Book Thief
      1.000000
    
    
      3
      2020
      BN0319
      HC0006
      Thief of Always
      The Book Thief
      0.510586
    
    
      4
      2449
      BN1295
      HC0006
      Midnight Thief (Midnight Thief Series #1)
      The Book Thief
      0.512195

If you want to include pairs with missing value in the output, you need to set the allow_missing flag to True in the apply_matcher method.

	Unique values	Missing values	Comments
Attribute
ID	3022 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
Title	2859 (94.61%)	0 (0.0%)
Price	698 (23.1%)	0 (0.0%)
Author	1225 (40.54%)	0 (0.0%)
ISBN13	2869 (94.94%)	0 (0.0%)
Publisher	200 (6.62%)	0 (0.0%)
Publication_Date	1100 (36.4%)	0 (0.0%)
Pages	303 (10.03%)	12 (0.4%)	Joining on this attribute will ignore 12 (0.4%) rows.
Dimensions	1764 (58.37%)	20 (0.66%)	Joining on this attribute will ignore 20 (0.66%) rows.

	Unique values	Missing values	Comments
Attribute
ID	3099 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
Title	2913 (94.0%)	0 (0.0%)
UsedPrice	320 (10.33%)	3 (0.1%)	Joining on this attribute will ignore 3 (0.1%) rows.
NewPrice	883 (28.49%)	166 (5.36%)	Joining on this attribute will ignore 166 (5.36%) rows.
Author	1371 (44.24%)	5 (0.16%)	Joining on this attribute will ignore 5 (0.16%) rows.
ISBN10	3099 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
ISBN13	3099 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
Publisher	367 (11.84%)	0 (0.0%)
Publication_Date	1165 (37.59%)	0 (0.0%)
Pages	510 (16.46%)	0 (0.0%)
Dimensions	2572 (82.99%)	248 (8.0%)	Joining on this attribute will ignore 248 (8.0%) rows.

	_id	l_ID	r_ID
0	0	BN0001	HC0001
1	1	BN0002	HC0001
2	2	BN0003	HC0001
3	3	BN0007	HC0001
4	4	BN0008	HC0001

	_id	l_ID	r_ID	l_Title	r_Title	_sim_score
0	33	BN0070	HC0001	The Hunger Games (Hunger Games Series #1)	The Hunger Games(Hunger Games (Quality))	0.565250
1	1665	BN0019	HC0003	Looking for Alaska	Looking for Alaska	1.000000
2	1885	BN0043	HC0006	The Book Thief	The Book Thief	1.000000
3	2020	BN0319	HC0006	Thief of Always	The Book Thief	0.510586
4	2449	BN1295	HC0006	Midnight Thief (Midnight Thief Series #1)	The Book Thief	0.512195